We are a group of 4 members:
Aishwarya Sinhasane - avsinhas@iu.edu (In picture, Left top)
Himanshu Joshi - hsjoshi@iu.edu (In picture, Right bottom)
Sreelaxmi Chakkadath - schakkad@iu.edu (In picture, Left bottom)
Sumitha Vellinalur Thattai - svtranga@iu.edu (In picture, Right top)
Object detection, one of the basic tasks of computer vision, deals with classifying the image based on its content. In recent years, the approach of object detection has evolved rapidly and is being embraced in all fields ranging from healthcare to automotive industries.
The objective of our project is to classify images as either dogs or cats. Additionally, we also plan to find where the cat/dog is in the image. Although the task is simple to human eyes, computers find it hard to distinguish between images because of a plethora of factors including cluttered background, illumination conditions, deformations, occlusions among several others. We plan to build an end-to-end machine learning model which will help computers differentiate between cat and dog images with better accuracy.
The data that we plan to use is the CaDoD Kaggle data set. It contains a total of ~13k images of dogs and cats. We have used Stochastic Gradient Descent, Adaboost, and Gradient boosting to predict the image class. Gradient boosting model gives this highest accuracy and henceforth will be used as our baseline model. Furthermore, linear regression was used for the prediction of the position of cats and dogs.
Accuracy and Mean F1 score (harmonic mean of recall and precision) are used as evaluation parameters
Furthermore, we plan to improve the accuracy by implementing a CNN for classifying images and using RIdge/Lasso regression for boundary prediction. In addition to this, we plan to tune the hyperparameter by using GridsearchCV/RandomsearchCV
We have completed the following task this week:
Data Preprocessing – Image Rescaling and Augmentation
Built SGD classifier, Adaboost, and Gradient boost for image classification
Built linear regression model for boundary detection
Compare the performance of the above models using Accuracy and F1 score and chose the baseline model
Implemented homegrown logistic regression
The data we plan to use is the Kaggle data set. We will be using two files – one for image and the other for boundary:
The images are taken from the cadod.tar.gz
The boundary information of the images is from cadod.csv file
Image information (cadod.tar.gz):
There are ~13K images of various sizes and aspect ratios. Majority of the images are of 512 X 384 size
All the images are in RGB scale
The boundaries of the images are stored in the cadod.csv file
Attributes of the Boundary File (cadod.csv):
This has information about the image box coordinates
There are about 20 features in this data set
-15 numerical features
- This includes the image ID, the coordinates of the boundary boxes, and also the normalized coordinates of the boundary boxes
-5 categorical features
-This gives information about the occlusion, depiction, truncation, etc.
The following are the end-to-end task to achieve the results:
Image pre-processing (Rescaling and Augmentation)
Use SKlearn packages to build SGD Classifier, Adaboost and Gradientboost
Build a linear regression for box detection
Implement a homegrown logistic regression model
Since the data set is very large, we have carried out the above tasks were carried out only in a subset of data. Hence the results will be only directional and not accurate
from collections import Counter
import glob
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from PIL import Image
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import SGDClassifier, SGDRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score
from sklearn.model_selection import train_test_split
import tarfile
from tqdm import tqdm
import warnings
import seaborn as sns
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
def extract_tar(file, path):
"""
function to extract tar.gz files to specified location
Args:
file (str): path where the file is located
path (str): path where you want to extract
"""
with tarfile.open(file) as tar:
files_extracted = 0
for member in tqdm(tar.getmembers()):
if os.path.isfile(path + member.name[1:]):
continue
else:
tar.extract(member, path)
files_extracted += 1
tar.close()
if files_extracted < 3:
print('Files already exist')
path = 'images/'
extract_tar('cadod.tar.gz', path)
100%|██████████| 25936/25936 [00:00<00:00, 95863.30it/s]
Files already exist
df = pd.read_csv('cadod.csv')
df.head()
| ImageID | Source | LabelName | Confidence | XMin | XMax | YMin | YMax | IsOccluded | IsTruncated | ... | IsDepiction | IsInside | XClick1X | XClick2X | XClick3X | XClick4X | XClick1Y | XClick2Y | XClick3Y | XClick4Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0000b9fcba019d36 | xclick | /m/0bt9lr | 1 | 0.165000 | 0.903750 | 0.268333 | 0.998333 | 1 | 1 | ... | 0 | 0 | 0.636250 | 0.903750 | 0.748750 | 0.165000 | 0.268333 | 0.506667 | 0.998333 | 0.661667 |
| 1 | 0000cb13febe0138 | xclick | /m/0bt9lr | 1 | 0.000000 | 0.651875 | 0.000000 | 0.999062 | 1 | 1 | ... | 0 | 0 | 0.312500 | 0.000000 | 0.317500 | 0.651875 | 0.000000 | 0.410882 | 0.999062 | 0.999062 |
| 2 | 0005a9520eb22c19 | xclick | /m/0bt9lr | 1 | 0.094167 | 0.611667 | 0.055626 | 0.998736 | 1 | 1 | ... | 0 | 0 | 0.487500 | 0.611667 | 0.243333 | 0.094167 | 0.055626 | 0.226296 | 0.998736 | 0.305942 |
| 3 | 0006303f02219b07 | xclick | /m/0bt9lr | 1 | 0.000000 | 0.999219 | 0.000000 | 0.998824 | 1 | 1 | ... | 0 | 0 | 0.508594 | 0.999219 | 0.000000 | 0.478906 | 0.000000 | 0.375294 | 0.720000 | 0.998824 |
| 4 | 00064d23bf997652 | xclick | /m/0bt9lr | 1 | 0.240938 | 0.906183 | 0.000000 | 0.694286 | 0 | 0 | ... | 0 | 0 | 0.678038 | 0.906183 | 0.240938 | 0.522388 | 0.000000 | 0.370000 | 0.424286 | 0.694286 |
5 rows × 21 columns
The data set contains a total of 12,966 images of dogs and cats. Statistics:
cadod.csv.in which there are 12966 rows and 21 columns. print(f"There are a total of {len(glob.glob1(path, '*.jpg'))} images")
There are a total of 12966 images
print(f"The total size is {os.path.getsize(path)/1000} MB")
The total size is 830.016 MB
df.shape
(12966, 21)
Replace LabelName with human readable labels
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12966 entries, 0 to 12965 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ImageID 12966 non-null object 1 Source 12966 non-null object 2 LabelName 12966 non-null object 3 Confidence 12966 non-null int64 4 XMin 12966 non-null float64 5 XMax 12966 non-null float64 6 YMin 12966 non-null float64 7 YMax 12966 non-null float64 8 IsOccluded 12966 non-null int64 9 IsTruncated 12966 non-null int64 10 IsGroupOf 12966 non-null int64 11 IsDepiction 12966 non-null int64 12 IsInside 12966 non-null int64 13 XClick1X 12966 non-null float64 14 XClick2X 12966 non-null float64 15 XClick3X 12966 non-null float64 16 XClick4X 12966 non-null float64 17 XClick1Y 12966 non-null float64 18 XClick2Y 12966 non-null float64 19 XClick3Y 12966 non-null float64 20 XClick4Y 12966 non-null float64 dtypes: float64(12), int64(6), object(3) memory usage: 2.1+ MB
df.LabelName.replace({'/m/01yrx':'cat', '/m/0bt9lr':'dog'}, inplace=True)
df.isnull().sum()
ImageID 0 Source 0 LabelName 0 Confidence 0 XMin 0 XMax 0 YMin 0 YMax 0 IsOccluded 0 IsTruncated 0 IsGroupOf 0 IsDepiction 0 IsInside 0 XClick1X 0 XClick2X 0 XClick3X 0 XClick4X 0 XClick1Y 0 XClick2Y 0 XClick3Y 0 XClick4Y 0 dtype: int64
df.LabelName.value_counts()
dog 6855 cat 6111 Name: LabelName, dtype: int64
df.LabelName.value_counts().plot(kind='bar')
plt.title('Image Class Count')
plt.show()
df.describe()
| Confidence | XMin | XMax | YMin | YMax | IsOccluded | IsTruncated | IsGroupOf | IsDepiction | IsInside | XClick1X | XClick2X | XClick3X | XClick4X | XClick1Y | XClick2Y | XClick3Y | XClick4Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 12966.0 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 | 12966.000000 |
| mean | 1.0 | 0.099437 | 0.901750 | 0.088877 | 0.945022 | 0.464754 | 0.738470 | 0.013651 | 0.045427 | 0.001157 | 0.390356 | 0.424582 | 0.494143 | 0.506689 | 0.275434 | 0.447448 | 0.641749 | 0.582910 |
| std | 0.0 | 0.113023 | 0.111468 | 0.097345 | 0.081500 | 0.499239 | 0.440011 | 0.118019 | 0.209354 | 0.040229 | 0.358313 | 0.441751 | 0.405033 | 0.462281 | 0.415511 | 0.401580 | 0.448054 | 0.403454 |
| min | 1.0 | 0.000000 | 0.408125 | 0.000000 | 0.451389 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
| 25% | 1.0 | 0.000000 | 0.830625 | 0.000000 | 0.910000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.221292 | 0.096875 | 0.285071 | 0.130000 | 0.024323 | 0.218333 | 0.405816 | 0.400000 |
| 50% | 1.0 | 0.061250 | 0.941682 | 0.059695 | 0.996875 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.435625 | 0.415625 | 0.531919 | 0.623437 | 0.146319 | 0.480838 | 0.825000 | 0.646667 |
| 75% | 1.0 | 0.167500 | 0.998889 | 0.144853 | 0.999062 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.609995 | 0.820000 | 0.787500 | 0.917529 | 0.561323 | 0.729069 | 0.998042 | 0.882500 |
| max | 1.0 | 0.592500 | 1.000000 | 0.587088 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.999375 | 0.999375 | 1.000000 | 0.999375 | 0.999375 | 0.999375 | 1.000000 | 0.999375 |
# plot random 6 images
fig, ax = plt.subplots(nrows=2, ncols=3, sharex=False, sharey=False,figsize=(15,10))
ax = ax.flatten()
for i,j in enumerate(np.random.choice(df.shape[0], size=6, replace=False)):
img = mpimg.imread(path + df.ImageID.values[j] + '.jpg')
h, w = img.shape[:2]
coords = df.iloc[j,4:8]
ax[i].imshow(img)
ax[i].set_title(df.LabelName[j])
ax[i].add_patch(plt.Rectangle((coords[0]*w, coords[2]*h),
coords[1]*w-coords[0]*w, coords[3]*h-coords[2]*h,
edgecolor='red', facecolor='none'))
plt.tight_layout()
plt.show()
plt.figure(figsize=(22, 15))
corr = df.corr()
mask = np.triu(np.ones_like(df.corr(), dtype=np.bool))
hm = sns.heatmap(corr, annot = True,cmap='BrBG', mask = mask)
hm.set_title('Correlation Heatmap', fontdict={'fontsize':24})
plt.show()
<ipython-input-16-76ee77985b5e>:3: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations mask = np.triu(np.ones_like(df.corr(), dtype=np.bool))
Go through all images and record the shape of the image in pixels and the memory size
img_shape = []
img_size = np.zeros((df.shape[0], 1))
for i,f in enumerate(tqdm(glob.glob1(path, '*.jpg'))):
file = path+'/'+f
img = Image.open(file)
img_shape.append(f"{img.size[0]}x{img.size[1]}")
img_size[i] += os.path.getsize(file)
100%|██████████| 12966/12966 [00:01<00:00, 7548.70it/s]
Count all the different image shapes
img_shape_count = Counter(img_shape)
# create a dataframe for image shapes
img_df = pd.DataFrame(list(img_shape_count.items()), columns=['img_shape','img_count'])
img_df.head()
| img_shape | img_count | |
|---|---|---|
| 0 | 512x384 | 3782 |
| 1 | 493x384 | 1 |
| 2 | 512x341 | 2013 |
| 3 | 512x381 | 14 |
| 4 | 384x473 | 1 |
There are a ton of different image shapes. Let's narrow this down by getting a sum of any image shape that has a cout less than 100 and put that in a category called other
img_df = img_df.append({'img_shape': 'other','img_count': img_df[img_df.img_count < 100].img_count.sum()},
ignore_index=True)
Drop all image shapes
img_df.shape
(595, 2)
img_df = img_df[img_df.img_count >= 100]
Check if the count sum matches the number of images
img_df.img_count.sum() == df.shape[0]
True
img_info = pd.DataFrame(img_shape, columns = ['Img_shape'])
img_info['Label'] = df.LabelName
img_info.head()
| Img_shape | Label | |
|---|---|---|
| 0 | 512x384 | dog |
| 1 | 512x384 | dog |
| 2 | 493x384 | dog |
| 3 | 512x341 | dog |
| 4 | 512x381 | dog |
img_info.shape
(12966, 2)
Rescaling the images
Data augmentation
This is done to expand our data set so it is very diverse.
Enables model to learn generalized features instead of overfitting on the dataset
Types of augmentation data used:
Flip
Rotation
Zoom
img_df.sort_values('img_count', inplace=True)
img_df.plot(x='img_shape', y='img_count', kind='barh', figsize=(8,8), legend=False)
plt.title('Image Shape Counts')
plt.show()
# convert to megabytes
img_size = img_size / 1000
fig, ax = plt.subplots(1, 2, figsize=(15,5))
fig.suptitle('Image Size Distribution')
ax[0].hist(img_size, bins=50)
ax[0].set_title('Histogram')
ax[0].set_xlabel('Image Size (MB)')
ax[1].boxplot(img_size, vert=False, widths=0.5)
ax[1].set_title('Boxplot')
ax[1].set_xlabel('Image Size (MB)')
ax[1].set_ylabel('Images')
plt.show()
dog_cnt = img_info[img_info['Label']=='dog']['Img_shape'].value_counts()
# create a dataframe for image shapes
img_df_dog = pd.DataFrame(dog_cnt)
img_df_dog = img_df_dog.reset_index()
img_df_dog.columns = ['Img_shape', 'Count']
img_df_dog.shape
(492, 2)
img_df_dog = img_df_dog.append({'Img_shape': 'other','Count': img_df_dog[img_df_dog.Count < 100].Count.sum()},
ignore_index=True)
img_df_dog = img_df_dog[img_df_dog.Count >= 100]
img_df_dog.shape
(8, 2)
cat_cnt = img_info[img_info['Label']=='cat']['Img_shape'].value_counts()
# create a dataframe for image shapes
img_df_cat = pd.DataFrame(cat_cnt)
img_df_cat = img_df_cat.reset_index()
img_df_cat.columns = ['Img_shape', 'Count']
img_df_cat.shape
(475, 2)
img_df_cat = img_df_cat.append({'Img_shape': 'other','Count': img_df_cat[img_df_cat.Count < 100].Count.sum()},
ignore_index=True)
img_df_cat = img_df_cat[img_df_cat.Count >= 100]
img_df_cat.shape
(8, 2)
img_df_dog.sort_values('Count', inplace=True)
img_df_cat.sort_values('Count', inplace=True)
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Image Size')
# Dog
sns.barplot(ax=axes[0], x=img_df_dog.Img_shape, y=img_df_dog.Count, color = 'blue')
axes[0].set_title('Image Shape of Dog')
# Dog
sns.barplot(ax=axes[1], x=img_df_cat.Img_shape, y=img_df_cat.Count, color = 'blue')
axes[1].set_title('Image Shape of Cat')
plt.show()
img_df_dog.astype({'Count':'int32'})
fig, ax = plt.subplots(figsize=(12, 8))
width = 0.5
ax.bar(img_df_dog.Img_shape, img_df_dog.Count, width, label='Dog')
ax.bar(img_df_dog.Img_shape, img_df_cat.Count, width,bottom = img_df_dog.Count,
label='Cat')
ax.set_ylabel('Count')
ax.set_title('Image Shape by Label')
ax.legend()
plt.show()
!mkdir -p images/resized
%%time
# resize image and save, convert to numpy
img_arr = np.zeros((df.shape[0],128*128*3)) # initialize np.array
for i, f in enumerate(tqdm(df.ImageID)):
img = Image.open(path+f+'.jpg')
img_resized = img.resize((128,128))
img_resized.save("images/resized/"+f+'.jpg', "JPEG", optimize=True)
img_arr[i] = np.asarray(img_resized, dtype=np.uint8).flatten()
100%|██████████| 12966/12966 [00:50<00:00, 258.13it/s]
CPU times: user 43.9 s, sys: 2.5 s, total: 46.4 s Wall time: 50.2 s
Plot the resized and filtered images
# plot random 6 images
fig, ax = plt.subplots(nrows=2, ncols=3, sharex=False, sharey=False,figsize=(15,10))
ax = ax.flatten()
for i,j in enumerate(np.random.choice(df.shape[0], size=6, replace=False)):
img = mpimg.imread(path+'/resized/'+df.ImageID.values[j]+'.jpg')
h, w = img.shape[:2]
coords = df.iloc[j,4:8]
ax[i].imshow(img)
ax[i].set_title(df.iloc[j,2])
ax[i].add_patch(plt.Rectangle((coords[0]*w, coords[2]*h),
coords[1]*w-coords[0]*w, coords[3]*h-coords[2]*h,
edgecolor='red', facecolor='none'))
plt.tight_layout()
plt.show()
# encode labels
df['Label'] = (df.LabelName == 'dog').astype(np.uint8)
mkdir -p data
np.save('data/img.npy', img_arr.astype(np.uint8))
np.save('data/y_label.npy', df.Label.values)
np.save('data/y_bbox.npy', df[['XMin', 'YMin', 'XMax', 'YMax']].values.astype(np.float32))
X = np.load('data/img.npy', allow_pickle=True)
y_label = np.load('data/y_label.npy', allow_pickle=True)
y_bbox = np.load('data/y_bbox.npy', allow_pickle=True)
idx_to_label = {1:'dog', 0:'cat'} # encoder
Double check that it loaded correctly
# plot random 6 images
fig, ax = plt.subplots(nrows=2, ncols=3, sharex=False, sharey=False,figsize=(15,10))
ax = ax.flatten()
for i,j in enumerate(np.random.choice(X.shape[0], size=6, replace=False)):
coords = y_bbox[j] * 128
ax[i].imshow(X[j].reshape(128,128,3))
ax[i].set_title(idx_to_label[y_label[j]])
ax[i].add_patch(plt.Rectangle((coords[0], coords[1]),
coords[2]-coords[0], coords[3]-coords[1],
edgecolor='red', facecolor='none'))
plt.tight_layout()
plt.show()
Image Detection:
SGD Classifier:
This is a logistic regression model that uses stochastic gradient descent for optimizing the log loss
Loss function to be minimized – Log Loss
AdaBoost Classifier:
It is an adaptive model in which subsequent weak learners are modified to accommodate for instances that were misclassified by previous classifiers
Log loss to be minimized – Exponential loss functions
Gradient Boosting Classifier:
This is also a boosting algorithm that uses ensemble of week learners to produce a strong regression model
Loss function to be minimized – Log Loss
Take only a portion of the data set for training and testing purporse. We can use the entire data set later
import random
random.seed(10)
idxs = random.sample(range(0, X.shape[0]), 2500)
X_final = X[idxs]
y_label_final = y_label[idxs]
X_train, X_test, y_train, y_test_label = train_test_split(X, y_label, test_size=0.01, random_state=27)
I'm choosing SGDClassifier because the data is large and I want to be able to perform stochastic gradient descent and also its ability to early stop. With this many parameters, a model can easily overfit so it's important to try and find the point of where it begins to overfit and stop for optimal results.
Validation set size of 0.1
The log loss formula for the binary case is as follows :
$\text log loss = -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $
where $m$ is the number of data points , log is Natural Logarithm
%%time
model = SGDClassifier(loss='log', n_jobs=-1, random_state=27, learning_rate='adaptive', eta0=1e-10,
early_stopping=True, validation_fraction=0.1, n_iter_no_change=3)
# 0.2 validation TODO
model.fit(X_train, y_train)
CPU times: user 16.1 s, sys: 21.5 s, total: 37.6 s Wall time: 1min 17s
SGDClassifier(early_stopping=True, eta0=1e-10, learning_rate='adaptive',
loss='log', n_iter_no_change=3, n_jobs=-1, random_state=27)
model.n_iter_
4
Did it stop too early? Let's retrain with a few more iterations to see. Note that SGDClassifier has a parameter called validation_fraction which splits a validation set from the training data to determine when it stops.
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=27)
model2 = SGDClassifier(loss='log', n_jobs=-1, random_state=27, learning_rate='adaptive', eta0=1e-10)
epochs = 30
train_acc = np.zeros(epochs)
valid_acc = np.zeros(epochs)
for i in tqdm(range(epochs)):
model2.partial_fit(X_train, y_train, np.unique(y_train))
#log
train_acc[i] += np.round(accuracy_score(y_train, model2.predict(X_train)),3)
valid_acc[i] += np.round(accuracy_score(y_valid, model2.predict(X_valid)),3)
100%|██████████| 30/30 [00:23<00:00, 1.26it/s]
plt.plot(train_acc, label='train')
plt.plot(valid_acc, label='valid')
plt.title('CaDoD Training')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
del model2
Accuracy - the ratio of correction predictions to the total predictions
Mean F1 score - the harmonic means of recall and precision value
F1 score = 2precision recall / (precision + recall)
Recall is the ratio of true positives to all actual positives
Precision is the ratio of true positives to all predicted positives
F1 score metric weights both recall and precision equally and thereby a higher value of recall and precision will ensure superior performance of the model
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
'Train F1',
'valid F1',
'Test F1',
"Train MSE",
"Valid MSE",
"Test MSE",
])
exp_name = f"Baseline: Linear Model with 10% validation"
expLog.loc[0,:4] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test_label, model.predict(X_test))],3))
<ipython-input-124-4c467d09feb9>:2: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead.
expLog.loc[0,:4] = [f"{exp_name}"] + list(np.round(
from sklearn.metrics import f1_score
expLog.loc[0,4:7] = list(np.round(
[f1_score(y_train, model.predict(X_train)),
f1_score(y_valid, model.predict(X_valid)),
f1_score(y_test_label, model.predict(X_test))],3))
<ipython-input-125-d0ff409833bc>:2: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead. expLog.loc[0,4:7] = list(np.round(
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train F1 | valid F1 | Test F1 | Train MSE | Valid MSE | Test MSE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline: Linear Model with 10% validation | 0.584 | 0.574 | 0.554 | 0.639 | 0.634 | 0.603 | NaN | NaN | NaN |
Validation set size of 0.2
%%time
model_2 = SGDClassifier(loss='log', n_jobs=-1, random_state=27, learning_rate='adaptive', eta0=1e-10,
early_stopping=True, validation_fraction=0.2, n_iter_no_change=3)
# 0.2 validation TODO
model_2.fit(X_train, y_train)
CPU times: user 14.2 s, sys: 16.8 s, total: 31 s Wall time: 58.8 s
SGDClassifier(early_stopping=True, eta0=1e-10, learning_rate='adaptive',
loss='log', n_iter_no_change=3, n_jobs=-1, random_state=27,
validation_fraction=0.2)
exp_name = f"Baseline: Linear Model with 20% validation"
expLog.loc[1,:4] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model_2.predict(X_train)),
accuracy_score(y_valid, model_2.predict(X_valid)),
accuracy_score(y_test_label, model.predict(X_test))],3))
<ipython-input-128-30a8eca74b29>:2: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead.
expLog.loc[1,:4] = [f"{exp_name}"] + list(np.round(
expLog.loc[1,4:7] = list(np.round(
[f1_score(y_train, model_2.predict(X_train)),
f1_score(y_valid, model_2.predict(X_valid)),
f1_score(y_test_label, model_2.predict(X_test))],3))
<ipython-input-129-82c65a38772f>:1: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead. expLog.loc[1,4:7] = list(np.round(
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train F1 | valid F1 | Test F1 | Train MSE | Valid MSE | Test MSE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline: Linear Model with 10% validation | 0.584 | 0.574 | 0.554 | 0.639 | 0.634 | 0.603 | NaN | NaN | NaN |
| 1 | Baseline: Linear Model with 20% validation | 0.581 | 0.551 | 0.554 | 0.664 | 0.644 | 0.679 | NaN | NaN | NaN |
clf_GB = GradientBoostingClassifier(n_estimators=100)
clf_GB.fit(X_train, y_train)
GradientBoostingClassifier()
exp_name = f"Baseline: Gradient Boosting"
expLog.loc[2,:4] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, clf_GB.predict(X_train)),
accuracy_score(y_valid, clf_GB.predict(X_valid)),
accuracy_score(y_test_label, clf_GB.predict(X_test))],3))
<ipython-input-133-b303cb74ff73>:2: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead.
expLog.loc[2,:4] = [f"{exp_name}"] + list(np.round(
exp_name = f"Baseline: Gradient Boosting"
expLog.loc[2,4:7] = list(np.round(
[f1_score(y_train, clf_GB.predict(X_train)),
f1_score(y_valid, clf_GB.predict(X_valid)),
f1_score(y_test_label, clf_GB.predict(X_test))],3))
<ipython-input-134-dd4558d1c678>:2: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead. expLog.loc[2,4:7] = list(np.round(
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train F1 | valid F1 | Test F1 | Train MSE | Valid MSE | Test MSE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline: Linear Model with 10% validation | 0.584 | 0.574 | 0.554 | 0.639 | 0.634 | 0.603 | NaN | NaN | NaN |
| 1 | Baseline: Linear Model with 20% validation | 0.581 | 0.551 | 0.554 | 0.664 | 0.644 | 0.679 | NaN | NaN | NaN |
| 2 | Baseline: Gradient Boosting | 0.774 | 0.614 | 0.608 | 0.799 | 0.661 | 0.667 | NaN | NaN | NaN |
y_pred_label = model.predict(X_test)
y_pred_label_proba = model.predict_proba(X_test)
fig, ax = plt.subplots(nrows=2, ncols=5, sharex=False, sharey=False,figsize=(15,6))
ax = ax.flatten()
for i in range(10):
img = X_test[i].reshape(128,128,3)
ax[i].imshow(img)
ax[i].set_title("Ground Truth: {0} \n Prediction: {1} | {2:.2f}".format(idx_to_label[y_test_label[i]],
idx_to_label[y_pred_label[i]],
y_pred_label_proba[i][y_pred_label[i]]),
color=("green" if y_pred_label[i]==y_test_label[i] else "red"))
plt.tight_layout()
plt.show()
This is used for boundary Detection
Loss functions to be minimized - Mean Square Error (MSE)
The mean square function formula is as follows : $ \text{MSE}({\mathbf{\theta}}; \mathbf{X}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{( \hat{y_i} - y_i)^2} $
where $m$ is the number of data points , $\hat{y_i}$ is the predicted value
X_final = X[idxs]
y_bbox_final = y_bbox[idxs]
X_train, X_test, y_train, y_test = train_test_split(X, y_bbox, test_size=0.01, random_state=27)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=27)
%%time
from sklearn.linear_model import LinearRegression
# TODO closed loop solution, could use Lasso Ridge
model = LinearRegression() #fill in
model.fit(X_train, y_train)
# might take a few minutes to train
#CPU times: user 1h 26min 40s, sys: 5min 53s, total: 1h 32min 34s
#Wall time: 17min 24s
CPU times: user 45min 35s, sys: 8min 54s, total: 54min 30s Wall time: 18min 21s
LinearRegression()
Mean Square Error:
expLog.iloc[0,7:] = list(np.round([mean_squared_error(y_train, model.predict(X_train)),
mean_squared_error(y_valid, model.predict(X_valid)),
mean_squared_error(y_test, model.predict(X_test))],3))
expLog.iloc[1,7:] = list(np.round([mean_squared_error(y_train, model.predict(X_train)),
mean_squared_error(y_valid, model.predict(X_valid)),
mean_squared_error(y_test, model.predict(X_test))],3))
expLog.iloc[2,7:] = list(np.round([mean_squared_error(y_train, model.predict(X_train)),
mean_squared_error(y_valid, model.predict(X_valid)),
mean_squared_error(y_test, model.predict(X_test))],3))
expLog
| exp_name | Train Acc | Valid Acc | Test Acc | Train F1 | valid F1 | Test F1 | Train MSE | Valid MSE | Test MSE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline: Linear Model with 10% validation | 0.584 | 0.574 | 0.554 | 0.639 | 0.634 | 0.603 | 0.0 | 0.036 | 0.035 |
| 1 | Baseline: Linear Model with 20% validation | 0.581 | 0.551 | 0.554 | 0.664 | 0.644 | 0.679 | 0.0 | 0.036 | 0.035 |
| 2 | Baseline: Gradient Boosting | 0.774 | 0.614 | 0.608 | 0.799 | 0.661 | 0.667 | 0.0 | 0.036 | 0.035 |
y_pred_bbox = model.predict(X_test)
fig, ax = plt.subplots(nrows=2, ncols=3, sharex=False, sharey=False,figsize=(15,10))
ax = ax.flatten()
for i,j in enumerate(np.random.choice(X_test.shape[0], size=6, replace=False)):
img = X_test[j].reshape(128,128,3)
coords = y_pred_bbox[j] * 128
ax[i].imshow(img)
ax[i].set_title("Ground Truth: {0} \n Prediction: {1} | {2:.2f}".format(idx_to_label[y_test_label[j]],
idx_to_label[y_pred_label[j]],
y_pred_label_proba[j][y_pred_label[j]]),
color=("green" if y_pred_label[j]==y_test_label[j] else "red"))
ax[i].add_patch(plt.Rectangle((coords[0], coords[1]),
coords[2]-coords[0], coords[3]-coords[1],
edgecolor='red', facecolor='none'))
plt.tight_layout()
plt.show()
Implement a Homegrown Logistic Regression model. Extend the loss function from CXE to CXE + MSE, i.e., make it a complex multitask loss function the resulting model predicts the class and bounding box coordinates at the same time.
X_train, X_test, y_train, y_test_label = train_test_split(X, y_label, test_size=0.01, random_state=27)
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test_label = train_test_split(X_final, y_label_final, test_size=0.01, random_state=27)
#s = StandardScaler()
#s.fit_transform(X_train)
#s.transform(X_test)
# Source - HW 7
class LogisticRegressionHomegrown(object):
def __init__(self):
"""
Constructor for the homgrown Logistic Regression
Args:
None
Return:
None
"""
self.coef_ = None
self.intercept_ = None
self._theta = None
self.history = {"cost": [],
"acc": [],
"val_cost":[],
"val_acc": [],
'val_prob':[],
"prob":[]}
def _grad(self, X, y):
"""
Calculates the gradient of the Logistic Regression
objective function
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
Return:
grad(ndarray): gradient
"""
n = X.shape[0]
scores = self._predict_raw(X)
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
probs[range(n),y] -= 1
gradient = np.dot(X.T, probs) / n
return gradient
def _gd(self, X, y, max_iter, alpha, X_val, y_val):
for i in range(max_iter):
metrics = self.score(X, y)
self.history["cost"].append(metrics["cost"])
self.history["acc"].append(metrics["acc"])
self.history["prob"].append(metrics["prob"])
if X_val is not None:
metrics_val = self.score(X_val, y_val)
self.history["val_cost"].append(metrics_val["cost"])
self.history["val_acc"].append(metrics_val["acc"])
self.history["val_prob"].append(metrics_val["prob"])
# gradient
grad = self._grad(X, y)
self._theta -= alpha * grad
def fit(self, X, y, max_iter=1500, alpha=0.05, val_data=None):
X = np.c_[np.ones(X.shape[0]), X]
if val_data is not None:
X_val, y_val = val_data
X_val = np.c_[np.ones(X_val.shape[0]), X_val]
else:
X_val = None
y_val = None
if self._theta is None:
self._theta = np.random.rand(X.shape[1], len(np.unique(y)))
self._gd(X, y, max_iter, alpha, X_val, y_val)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
def score(self, X, y):
n = X.shape[0]
scores = self._predict_raw(X)
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
corect_logprobs = -np.log(probs[range(n),y])
data_loss = np.sum(corect_logprobs) / n
# predictions
pred = np.argmax(scores, axis=1)
# accuracy
acc = accuracy_score(y, pred)
metrics = {"acc": acc, "cost": data_loss, 'prob':corect_logprobs}
return metrics
def _predict_raw(self, X):
if X.shape[1] == len(self._theta):
scores = np.dot(X, self._theta)
else:
scores = np.dot(X, self.coef_) + self.intercept_
return scores
def predict(self, X):
"""
Predicts class for each object in X
Args:
X(ndarray): objects
Return:
pred(ndarray): class for each object
"""
# get scores for each class
scores = self._predict_raw(X)
pred = np.argmax(scores, axis=1)
return pred
X_train, X_test, y_train, y_test_label = train_test_split(X_final, y_label_final, test_size=0.01, random_state=27)
np.random.seed(42)
if np.max(X_train) > 4.:
X_train = X_train.astype(np.float32) / 255.
if np.max(X_test) > 4.:
X_test = X_test.astype(np.float32) / 255.
y_train=y_train.astype(int)
y_test=y_test.astype(int)
class FixedLogisticRegressionHomegrown(LogisticRegressionHomegrown):
def __init__(self):
# call the constructor of the parent class
super(FixedLogisticRegressionHomegrown, self).__init__()
#==================================================#
# Place your code here #
# Redefine a method which causes the error #
# Hint: only one method #
#==================================================#
#==================================================#
# Your code starts here #
#==================================================#
def _predict_raw(self, X):
# check whether X has appended bias feature or not
if X.shape[1] == len(self._theta):
scores = np.dot(X, self._theta)
else:
scores = np.dot(X, self.coef_) + self.intercept_
# normalize raw scores to prevent overflow
scores -= np.max(scores, axis=1, keepdims=True)
return scores
#==================================================#
# Your code ends here #
# Please don't add code below here #
#==================================================#
model_lr_homegrown_fixed = FixedLogisticRegressionHomegrown()
model_lr_homegrown_fixed.fit(X_train[:1000], y_train[:1000], max_iter=2000, alpha=0.05, val_data=(X_test, y_test_label))
plt.figure(figsize=(20, 8))
plt.suptitle("Homegrown Logistic Regression")
plt.subplot(121)
plt.plot(model_lr_homegrown_fixed.history["cost"], label="Train")
plt.plot(model_lr_homegrown_fixed.history["val_cost"], label="Test")
plt.legend(loc="upper left")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.xlim([0,2000])
plt.subplot(122)
plt.plot(model_lr_homegrown_fixed.history["acc"], label="Train")
plt.plot(model_lr_homegrown_fixed.history["val_acc"], label="Test")
plt.legend(loc="upper left")
plt.xlabel("Iteration")
plt.xlim([0,2000])
plt.ylabel("Accuracy");
y_pred_test = model_lr_homegrown_fixed.predict(X_test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test_label, y_pred_test)
print(acc)
0.68
X_train, X_test, y_train, y_test_label = train_test_split(X, y_label, test_size=0.01, random_state=27)
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X, y_bbox, test_size=0.01, random_state=27)
from sklearn.metrics import log_loss
def comb_MSE_CXE(x,y1,y2):
sum_error = []
for i in range(0, len(x)):
y1_pred = model.predict([x[i]])
#print(y1_pred)
y2_pred = y2[i]
mse = mean_squared_error(y1[i], y1_pred[0])
cxe = y2_pred
sum1 = cxe + mse
sum_error.append(sum1)
#print(mse, cxe)
return sum(filter(lambda x: x != float('inf'), sum_error))/len(x)
comb_MSE_CXE(X_train[:1000],y_train_b[:1000], np.mean(model_lr_homegrown_fixed.history['prob'][:1000], axis = 1))
61.58699290572804
Stage 1:
We ran SGD (10% and 20% validation set), Adaboost and Gradient boost in a small data set to narrow of model search space
From the expLog table, we see the Gradient boost has a better accuracy than Adaboost
We will use this model to train our entire data set
Number of experiements conducted - 4 for image classification and 1 for boundary detection
Stage 2:
Based on the results from the above experiment, we conducted SGD (10% and 20% validation set) and Gradient boost on the entire image set
Gradient boost performed well on train, validation and testing set (both along accuracy and F1 score), hence we will be using this as our baseline model going forward
Number of experiments conducted - 3 for image classification and 1 for boundary detection
The log loss formula for the binary case is as follows :
$\text log loss = -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $
where $m$ is the number of data points , log is Natural Logarithm
We implemented a homegrown logistic regression. In this case, we used only a portion (2500 images) to build a model. This model gives the log loss for each images for each epochs.
Using this loss function, along with the MSE of the linear regression, we created a function that gives the sum of CXE and MSE for each images. We then take the mean of all the images which gives the final loss of the model
Computer vision is a rapidly growing field and image classification has become a primary focus of many fields in the present world. Our project fouses on classifying images as either cat of dog and also finding the boundries of the image.
We have used SGD classifier, adaboost and gradient boost to train our image data set. From the above observation, we see that gradient boosting has a very high accuracy in comparison to the other models, and therefore we will be using gradient boosting as our baseline model for image detection moving forward. F1 score, which is a combination of precision and recall, and accuracy scores are considerably higher for gradient boosting (vs other models), which helped us lean towards selecting this model as our baseline.
And we will be using linear regression for boundary detection as the MSE value for train, validation, and test dataset are very small.
Our next steps will be as follows:
Tune hyperparameters using GridSearchCV/RandomSearchCV
Image boundary detection can be improved further by using Ridge or Lasso regression
Building a CNN model to improve accuracy of image detection
Perform a t-test to compare the baseline model with the improved model to make sure our improved model performs significantly better than the baseline model